Drawing Sound Conclusions from Noisy Judgments
نویسندگان
چکیده
The quality of a search engine is typically evaluated using hand-labeled data sets, where the labels indicate the relevance of documents to queries. Often the number of labels needed is too large to be created by the best annotators, and so less accurate labels (e.g. from crowdsourcing) must be used. This introduces errors in the labels, and thus errors in standard precision metrics (such as P@k and DCG); the lower the quality of the judge, the more errorful the labels, consequently the more inaccurate the metric. We introduce equations and algorithms that can adjust the metrics to the values they would have had if there were no annotation errors. This is especially important when two search engines are compared by comparing their metrics. We give examples where one engine appeared to be statistically significantly better than the other, but the effect disappeared after the metrics were corrected for annotation error. In other words the evidence supporting a statistical difference was illusory, and caused by a failure to account for annotation error. CCS Concepts •Information systems→ Presentation of retrieval results;
منابع مشابه
Testing Bayesian and heuristic predictions of mass judgments of colliding objects
Mass judgments of colliding objects have been used to explore people's understanding of the physical world because they are ecologically relevant, yet people display biases that are most easily explained by a small set of heuristics. Recent work has challenged the heuristic explanation, by producing the same biases from a model that copes with perceptual uncertainty by using Bayesian inference ...
متن کاملبکارگیری سیستمهای اطلاعات جغرافیایی (GIS) در ارزیابی آلودگی صوتی محیطهای کار: مطالعه موردی کارخانه نساجی
Background and Objective: Noise pollution causes many physiological, psychological, economic and social effects on human life. This issue is more important in the environment of industrial workplaces. This research aimed to adopt the functions of GIS for evaluating and spatial analysis of noises in industrial environments. Materials and Methods: At the initial step, the spatial data for indust...
متن کاملSpatial Congruity in Audiovisual Synchrony Judgments
Rainer Guski Dept. of Psychology, Ruhr-University Bochum, Germany [email protected] Abstract The systematic analysis of the perception of audiovisual synchrony has shown that auditory delays are tolerated to a certain extent in synchrony judgments, but the variation of spatial separation between light and sound has brought conflicting results: While Lewald & Guski (2003) did not find a...
متن کاملDebunking the Myth of Value-Neutral Virginity: Toward Truth in Scientific Advertising
The scientific community often portrays science as a value-neutral enterprise that crisply demarcates facts from personal value judgments. We argue that this depiction is unrealistic and important to correct because science serves an important knowledge generation function in all modern societies. Policymakers often turn to scientists for sound advice, and it is important for the wellbeing of s...
متن کاملNoisy Newtons: Unifying process and dependency accounts of causal attribution
There is a long tradition in both philosophy and psychology to separate process accounts from dependency accounts of causation. In this paper, we motivate a unifying account that explains people’s causal attributions in terms of counterfactuals defined over probabilistic generative models. In our experiments, participants see two billiard balls colliding and indicate to what extent ball A cause...
متن کامل